Frameworks for estimating causal effects in observational settings: comparing confounder adjustment and instrumental variables

To estimate causal effects, analysts performing observational studies in health settings utilize several strategies to mitigate bias due to confounding by indication. There are two broad classes of approaches for these purposes: use of confounders and instrumental variables (IVs). Because such approaches are largely characterized by untestable assumptions, analysts must operate under an indefinite paradigm that these methods will work imperfectly. In this tutorial, we formalize a set of general principles and heuristics for estimating causal effects in the two approaches when the assumptions are potentially violated. This crucially requires reframing the process of observational studies as hypothesizing potential scenarios where the estimates from one approach are less inconsistent than the other. While most of our discussion of methodology centers around the linear setting, we touch upon complexities in non-linear settings and flexible procedures such as target minimum loss-based estimation and double machine learning. To demonstrate the application of our principles, we investigate the use of donepezil off-label for mild cognitive impairment. We compare and contrast results from confounder and IV methods, traditional and flexible, within our analysis and to a similar observational study and clinical trial. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-01936-2.


Page 2 of 24
different populations or use cases. In addition, observational studies give researchers real world evidence surrounding the effectiveness and safety of a treatment to augment RCT findings (e.g. Phase 4 trials).
On the path to isolating causal effects, observational studies must address potential bias due to the non-randomization of the intervention. A common source of bias is confounding by indication, or treatment selection bias, where factors affect both the assignment of treatment and the targeted medical condition. These factors, called confounders, range from patient characteristics to other concurrent treatments.
There are two broad classes of approaches to mitigate treatment selection bias based on confounding variables and instrumental variables (IVs). Briefly, confounder approaches aim to "adjust" for all factors that both explain treatment assignment and the outcome. In contrast, IVs determine only the assignment of the treatment but, otherwise, are not associated with the outcome. IVs are used to define a subset of the population whose treatment assignment is free from confounding.
Fundamentally, confounder and IV approaches are characterized by untestable assumptions in practice. For example, the notion that all possible confounders have been adjusted for cannot be verified with data. Therefore, analysts must be able to operate under the assumption that these methods will work imperfectly. In other words, neither approach will fully overcome treatment selection bias but can provide a less biased estimate than had they not been used. Navigating this indefinite paradigm requires a general set of reasoning and intuition surrounding observational studies.
In this paper, we formalize a set of general principles and heuristics for estimating causal effects under treatment selection bias. Stemming from the two approaches, we outline three general steps. Firstly, one must be able to identify potential confounders and IVs in the scientific context of the study. Then, looking at the available dataset, judge which variables identified in step one are present, somewhat present (e.g. proxies), or missing. Thirdly, weighing the pros and cons of each methodology to consider which one could provide a more reasonable causal estimate. Our discussion is led by the criteria of internal validity (the ability for the approach to estimate the causal effect), external validity (the ability for the approach to generalize to relevant populations), and reproducibility (the ability for the results to be replicated in similar studies).
Throughout the paper, we will generally assume that the relationships in the data are linear in functional form so we may focus on issues related to unobserved confounding as opposed to model misspecification. We relax this linearity assumption when we speak about causal inference in non-linear settings and flexible modeling approaches.
To demonstrate the use of the principles outlined in the paper, as an illustrative example, we investigate the use of donepezil (brand name: Aricept) in mild cognitively impaired patients (MCI) to mitigate cognitive decline due to dementia. In the present day, donepezil is indicated for mild to severe Alzheimer's Disease (AD) but not MCI due to failure to show efficacy in clinical trials [1][2][3]. Nevertheless, study of donepezil and related compounds in MCI has continued in both observational studies of off-label practice [4,5] and clinical trials [6,7].
We analyze data from the Alzheimer's Disease Neuroimaging Initiate (ADNI), an observational, multi-center, natural history dataset that tracks cognitively normal, MCI, and AD subjects over time [8]. Because the use of donepezil is non-randomized in ADNI there may be confounding by indication where MCI patients who are prescribed donepezil are potentially suffering from more severe cognitive decline than those who do not. Under this setting, we will compare confounder and IV model estimates to each other and to that of related observational studies and clinical trials.
In the literature, there exists other tutorials and textbooks for causal inference methodology. In particular, Baiocchi, Cheng, and Small give a detailed overview of IVs in health research and, in section two of their paper, share a similar philosophy to this paper in that choosing between confounder and IV approaches includes weighing unmeasured confounding against the hypothesized validity of the IV [9]. Nevertheless, the scope of their tutorial surrounds a pedological overview of IVs whereas ours incorporates this information as part of a larger goal to provide detailed guidelines and heuristics for conducting observational studies. As such, we cover a broader range of topics such as the interaction between the confounders and IV approaches, comparing confounder adjustment and propensity score weighting methods, non-collapsibility, and external validity with each topic centered around potential assumption violations. We further include a full data analysis with a detailed comparison and discussion of each approach in order to clearly demonstrate the discussed principles.
Another important contribution of our tutorial is that we cover the recent trend of using machine learning (ML) for causal inference such as targeted minimum loss-based estimation (TMLE) and double machine learning (DML) [10][11][12]. As these tools are now widely used in applied data analyses, any modern tutorial on causal inference methodology should include commentary on this topic. By first outlining general principles for causal inference in observational studies with "traditional" methodology, our tutorial can more effectively discuss the potential benefits Page 3 of 24 Zawadzki et al. BMC Medical Research Methodology (2023) 23:122 and pitfalls of "data-driven" modeling for estimating causal effects. Furthermore, our applied data analysis, that includes the use of ML methods, provides a further critical evaluation of the novel ML methods. There exist little by way of practical comparisons and guidelines of traditional and novel methodology across confounder and IV methodology in the literature. Angrist and Frandsen consider this topic but focus on economic applications and the work is limited in terms of methodology utilized relative to this review [13]. The remainder of the paper is organized as follows. We begin by outlining the assumptions related to confounder adjustment and IV methods as well as the consequences of violating each assumption. We then describe common methodologies for confounder adjustment and IVs and their machine learning extensions, discussing further assumptions and considerations. Next, we outline scenarios where one approach is preferable to another. After presenting our analysis of the ADNI donepezil data and related remarks, we conclude with an overall discussion of the content of this paper.

Two approaches: confounders and instruments
We begin by considering the simple scenario depicted in Fig. 1. Such figures are called directed acyclic graphs (DAGs) where the nodes are variables, the edges represent a directed causal effect, and the greek letters represent the magnitude of the causal effect of the respective edge. Contextually, D is an indicator for which treatment was prescribed, Y represents the outcome of interest, U represents a confounding variable, and Z is an IV. In the donepezil example, D indicates the prescription of donepezil or not and Y is cognitive function as measured by the Alzheimer's Disease Assessment Scale (ADAS-cog) [14].
Firstly, we must define the causal estimand of interest using the potential outcomes framework [15]. Let Y i,t (1) be the potential outcome at time t had the individual taken the treatment ( D = 1 ) and Y i,t (0) be the potential outcome after 2 years had the individual not taken the treatment ( D = 0 ). In contrast, Y t is the observed outcome at time t. Therefore, in the situation characterized by Fig. 1, we want to isolate β , the average treatment effect (ATE). In the donepezil example, the ATE would be change from baseline in ADAS-cog after two years formalized below with t = 0 being baseline: We cannot observe both potential outcomes for any individual and thus we must use observed values from those who were prescribed either treatment option. Under the stable unit treatment values assumption (SUTVA), treatment assignment ignorability (i.e. Y (0), Y (1)⊥ ⊥D ), and positivity ( 0 < P(D = d) < 1 with d = 0, 1 ), we could use the following estimand Throughout this paper, we focus on mitigating issues when the assumption of ignorability fails to hold. Graphically, the θ and η edge weights in Fig. 1 represent the magnitude of U's influence on the assignment of donepezil and ADAS-cog, respectively. Clearly we have a violation because U is a confounder when θ = 0 and η = 0 resulting in Y(0), Y(1) ⊥ ⊥ D. Simply estimating the difference in the group means (the final lines) will not recover β due to the second equality failing. In short, indirect paths from D to Y (via U) pose major issues in obtaining causal estimates.
There are two ways we will describe how the confounder approach can isolate β . First, we need to find a set of variables X such that conditional ignorability, D⊥ ⊥Y (0), Y (1)|X is achieved. Alternatively, we need to condition upon variables X such that all "backdoor" paths from D to Y are blocked [16]. Backdoor paths are non-causal, indirect paths that connect D and Y. Visually, if we think of a DAG as a set of pipes, blocking all backdoor paths restricts the flow of water or "information" to only go through the pipe flowing from D to Y. Hence, we recover our direct causal effect. These two notions of confounding leads us to conclude that in Fig. 1 it is sufficient and necessary to condition on U and use the finite estimator of to obtain β . Note that there are many other descriptions of confounding and we will focus on the conditional ignorability and graphical definitions [17,18].
Alternatively, we may identify the treatment effect through an IV denoted as Z. For simplicity, we assume for now that only one IV is sufficient. Briefly, the definition of an IV is that Z must influence D (relevance), or α = 0 , does not cause Y conditioning on X (exclusion restriction), and is not associated with any unobserved confounders (independence). In Fig. 1, Z is an IV because α = 0 and there are no other arrows in or out of Z that go to Y. Figure 2 demonstrates how this latter  notion can be violated if either δ , ǫ , or φ are non-zero. In this case, Z is, in fact, a confounder but if δ = 0 and we are able to condition upon U, then Z is re-classified as an IV. Compared with the confounder approach, if we have access to an IV, we may be able to obtain a causal estimate without having to account for all possible confounders, which is a major potential advantage of using IVs over confounder adjustment methods. The building of a valid causal network requires both knowledge of all variables in the network and the arrows between them. Unfortunately, we cannot be sure that a posited DAG is correct using data. For example, to suggest unobserved confounding requires knowledge that goes beyond the dataset at hand. Nevertheless, we can still use DAGs to capture assumption violations as clearly as possible and move towards the best option in a given scenario. By first thinking outside the scope of the current dataset, one captures a fuller picture of the study.
The untestability of causal assumptions suggests that users of the confounder and IV approaches must think relatively and not in absolutes. For example, rather than arguing a set of confounders is sufficient for conditional ignorability, one should instead find confounders to condition upon that potentially bring the estimate closer to the true causal estimand. For IVs, rather than justifying whether have a true IV or not, we can think about how strongly the treatment is identified relative to potential violations in the assumptions of Z and the hypothesized overall magnitude of confounding.

Identifying and using confounders
Without directly modeling the response, in order to avoid multiplicity bias, a reasonable strategy to identify confounders is to first postulate variables that affect the response and then distinguish which of these variables may influence treatment assignment [19]. In the latter step, caution must be taken in the direction of causality: mistakenly adjusting for mediating variables on the pathway from D to Y may produce unintended consequences such as attenuation of the estimated treatment effect. To see this, consider Fig. 3 -a scenario where W is a mediating variable. A simple example of a meditator is the ADAS-cog measurement at one year, Y t=1 . Adjusting for Y t=1 will decrease a possible treatment effect because the measurement at one year is on the causal pathway to Y t=2 and we have blocked this path.
The first and strongest reference for determining causal links for confounders should be the underlying scientific mechanism. Such information can be based on prior basic science, epidemiological findings, and historical trials. These sources often help one to identify a vast majority of relevant confounding factors. Another source but of potentially lesser quality is past empirical studies done on predictors of the response. One should assess the quality of these studies in terms of replicability, precision, and study design before choosing to use the associated information. This information can additionally be used to identify suitable proxies for key confounders.
Given a set of potential confounders, it may not always be advantageous to select all of them in the data analysis. In reality, much of the confounding may be captured by a few variables such as basic demographics (e.g. age and sex), commonly collected lifestyle factors (e.g. smoking and alcohol use), and comorbidities (e.g. chronic disease and corresponding medication use). With each confounder included in the analysis, we must weigh moving towards conditional ignorability against overfitting (i.e. increased imprecision and Type II error), interpretability, and reproducibility. Consider that under non-linearity, adjusting for confounders may change the interpretation of the estimate of the treatment effect [18]. In the linear setting, a similar scenario can occur under treatment effect heterogeneity, which is when the effect of the treatment differs across the values of one or more factors [20].

Identifying and using instrumental variables
Unlike confounders, the use of IVs is not as straightforward and often requires more technical knowledge to employ effectively so we first will provide more background and intuition. The crux of the IV approach is that we use variation independent from confounding to identify treatment assignment. Because IVs, by definition, cannot be determined by other variables in the causal paradigm (there are exceptions: for example, see Fig. 4), we can assume that the IV values are "as good as" randomized in that they are not influenced by conditioned and unobserved confounders. As a result, the values of the treatment assignment generated by IVs, denoted D IV , are also randomized. It follows that the estimate using D IV , β IV , is theoretically free of unobserved confounding. Complexity arises in using IVs mainly because D and D IV are not technically the same variable. This means that β IV is a different estimand than β and the IVs cannot be used to directly calculate the ATE but, rather, the "local average treatment effect" (LATE) where we are "local" to variation in the IVs [21]. Fortunately, if there is no treatment effect heterogeneity and the assumptions for an IV are met, β IV is consistent for β.
As a simple demonstration of the above notions, the form of the LATE with a binary IV and binary treatment can be given by the Wald estimand in Eq. 1: Besides adding more intuition behind IV-derived treatment effects, this equation introduces the importance of IV strength or "predictive power" captured by Fig. 1. Heuristically, when α is small then the IV is "weak" and if α is sufficiently large then the IV is "strong. " In the linear setting, the finite sample bias of β IV is partially a function of α where we incur large bias for β with small values of α [22]. For instance, in Eq. 1, the fraction is inflated.
The impacts of weak IVs are not just limited to finite samples. Recall that we cannot confirm we have a true IV using observed data and so we must assume our IV estimate is inconsistent for β . In this case, as an IV becomes weaker, the sensitivity of the corresponding estimate β IV to IV independence assumption violations increases [23]. To elucidate this, suppose we had two IV candidates Z 1 and Z 2 with corresponding strengths α 1 and α 2 , where For the same degree of violation in the independence assumptions (e.g. in Fig. 2 δ = c > 0 where c is some constant) the inconsistency of an estimate derived from Z 2 would be greater than from using Z 1 . When there is treatment effect heterogeneity, even when we have a valid IV, β IV can be inconsistent for β because β IV is an estimand for a subset of the original population. Table 1 summarizes four distinct sub-populations related the IVs: always-takers, compliers, defiers, and never-takers. We can use potential outcomes once again but for treatment assignment: let Z be a binary instrument and D(0) be the treatment assignment had the value of the IV been 0 and D (1) had the values of the IV been 1.
In Table 1, it is clear that changing values of the IVs results in changing values of the treatment assignment only for compliers and defiers. Therefore, we cannot identify always-takers and never-takers using IVs. In addition, we must impose a further assumption that the defier population does not exist for a given IV. This notion is called "monotonicity" where D(1) ≥ D(0) or vice-versa. Thus, we conclude that the subpopulation identified by the IVs are the compliers. To explain why we require monotonicity, we can rewrite Eq. 1 as [24] Because of treatment effect heterogeneity, the ATE is differential depending on the subpopulation and there is the potential for a non-zero treatment effect in each group to cancel out. Of course, this does not occur if P(D(1) < D(0)) = 0 (no defiers) or (no treatment effect heterogeneity).
Through a similar derivation in the denominator of Eq. 1 as above, we arrive at a clearer definition of the LATE: β IV = E[Y (1) − Y (0)|D(1) > D(0)] or the ATE for compliers [21]. This LATE, changes with chosen IV. If there are multiple IVs, then the LATE is a weighted average of LATEs characterized by each IV. When we have covariates included to establish the validity of Z or decrease error in predicting D, then the LATE is an  estimand defined on a population conditional on these covariates. Furthermore, unless the model is saturated, always and never-takers are included [25,26]. As most models in practice include covariates, the interpretability of IV models can be nebulous. In practice, there usually exists treatment effect heterogeneity so before using IVs we must determine if it is reasonable to target the LATE as a proxy for the ATE estimand. Consider the following scenarios. First, when treatment effect heterogeneity is unrelated to the choice of treatment then the estimates for compliers will not be systematically different from the other subpopulations. Next, when the first-stage is strong, the population characterized by the IVs will be relatively close to the overall population of the study. For example, if we have found a strong genetic determinant of some condition that was a valid IV (i.e. a Mendelian randomization strategy), it is plausible that, for the vast majority of the population, the occurrence of the condition would vary with the assignment of the gene. It follows that if the IV is weak, the LATE will only capture a small subset of the original population, introducing significant inconsistency in estimating the ATE. Lastly, there is a developing set of literature that relaxes the assumptions for the Wald estimand to equal the ATE such as requiring heterogeneity in the outcome caused by the treatment assignment to be independent from heterogeneity in treatment assignment caused by the IV as well as the IV itself [27][28][29].
Identifying potential IVs is significantly less straightforward than identifying confounders, which is a main limitation of the approach. While there is usually abundant literature on predictors of a medical condition, the factors that determine the assignment of a treatment are difficult to study and are not usually studied. One reason for writing this paper is to generate expose biomedical researchers to IVs such that potential IVs can be shared in the literature similar to how predictors are. In a similar vein to the confounder adjustment approach, we may begin by determining factors that predict treatment assignment and then prune those that affect the outcome. Determining confounders first is helpful as variables that were once invalid IVs may become valid after holding certain confounders constant.
One popular source of IVs is variation in medical practice as it is well known that practice differs across physicians and regions across a wide variety of medical conditions [30][31][32]. If appropriate, we could use factors such as regional variation, facility prescribing patterns, attitudes to certain contraindications, physician preference, and calendar time as IVs [33,34]. For example, with access to the relevant data, physician preference can be quantified by tabulating the proportion of patients under each physician who were prescribed the treatment of interest. Following this, we can use these proportions to predict which treatment a new patient who sees any of these physicians will receive.
Even still, the validity of prescriber preference as an IV can be questioned. It could be that certain types of patients tend to select a physician that they know is more likely to give them the treatment (graphically, Fig. 2 edges from Z to U). Furthermore, geographic variation in general population health could necessitate higher utilization of treatments in some regions compared to others. Herein lies the value of identifying confounders in IV analyses: perhaps controlling for patient characteristics will block these pathways and greatly reduce assumption violations (e.g. Fig. 4). One takeaway, however, is that IV analysis can easily suffer from issues related to unobserved confounding.
Given a set of IVs, we should characterize each subpopulation. For medical practice patterns, most likely, some patients would not comply with a doctor's opinions; some patients could insist to get the treatment (alway-takers) and others would refuse under all circumstances (nevertakers). One that does that opposite of what the doctor says (defiers) is possible and we will have to assume that they do not exist, which is practically untestable but can be reasoned as unlikely. Under this assumption, the LATE would roughly be those who follow the doctors' orders. All of this considered, the analyst should determine whether the complier treatment effect is of scientific value.

Interactions of the confounder and IV approaches
The confounder and IVs approaches are deeply related. Therefore, even if an analyst decided to pursue one approach over another, awareness of the principles of the other approach is important. One pervasive issue in this vein is adjusting for an IV as if it was a confounder. Widely-cited guidelines such as Hirano and Imbens (2001) state that variables that are predictive of treatment assignment should be selected for confounder methods like propensity scores, [35] which risks adjusting for IVs and mediators. In the best case, treating IVs as confounders decreases precision because it does not explain variation in the response. Even worse, when there is unobserved confounding, existing inconsistency is amplified [36][37][38][39]. By adjusting for IVs, we reduce variation in the treatment that is uncorrelated with the unobserved confounding. Thus, variation in the treatment produced by unobserved confounding proportionally increases, which causes more bias in the treatment effect.
The impact of adjusting for IVs and mediators demonstrates why one should avoid a purely "kitchen sink, " data-driven approach to variable selection for causal inference. Simply because the estimate of the treatment effect changes when a variable is introduced does not necessarily mean it should be adjusted for. This is one reason why we advocate that confounders largely be sourced a priori by first hypothesizing predictors of the outcome. If one is reasonably certain that a variable is predictive of the outcome but is unlikely to be associated with the predictor of interest, one has a "precision variable, " which still may be of use. Specifically in the linear model setting with no unmeasured confounding, adjusting for such a variable will decrease standard errors in the treatment effect estimate with no cost to bias [18,40].

Methodology for estimating causal effects
In this section, we discuss three popular approaches for estimating causal effects: regression adjustment using ordinary least squares (OLS) or generalized linear models, propensity score weighting with inverse probability of treatment weighting (IPTW), and utilizing two-stage least squares (2SLS) with IVs. Regression and IPTW are used in the confounder approach while 2SLS serves as a methodology for the IV approach. Although there are many other methods such as g-computation, targeted learning, proximal causal learning, non-parametric 2SLS, and two-stage residual inclusion, we will spend most of our discussion on regression, IPTW, and 2SLS because they are most commonly used and well-studied. A summary of the points discussed in this section is presented in Table 2. We briefly speak about recent advances in later sections.
A more sophisticated version of Fig. 1 is presented in Fig. 5. We have added a vector of IVs of length j ( Z 1 , Z 2 , ...Z j ) and a vector of observed confounders of length k ( X 1 , X 2 , ...X k ). In addition, we have stochastic errors τ and ǫ for D and Y, respectively. For simplicity, assume the effect of each IV is the same magnitude and similarly for each confounder and that the outgoing arrows capture from the joint effect. Furthermore, U captures all unobserved confounding though, in reality, there are likely many variables. Assuming the relationships between variables are linear, we can write the following system of relevant structural equations: Equation 2 depicts the treatment assignment or "first stage" while Eq. 3 depicts the outcome or "second stage. " The estimand of interest is β D . For ease of exposition, we will assume a linear probability model (LPM) as in Eq. 3. Nevertheless, because the treatments are usually binary variables, the functional form of the treatment assignment is commonly characterized using a logit or probit model. Please see later sections for discussion on the use of LPMs for modeling binary treatment assignment. (2) Using the above two equations, we can provide a high-level overview of the three methods in practice. Regression methods fit Eq. 3 with the treatment and the observed confounders to estimate β D . U is not observed so the estimate is inconsistent due to misspecification. Meanwhile, IPTW first fits Eq. 2 with only the confounders to predict the propensity score for all subjects. The propensity scores will be used to compute a weighted sum that will allow us to estimate β D . Because U is missing, the predicted propensity scores will not be correct nor adequate to achieve ignorability conditional on the propensity score.
2SLS will fit Eq. 2 as stated (except for U) and use the predictions to construct D . We then effectively substitute D in Eq. 3 and fit an OLS model to estimate the coefficient in front of D . Importantly, the omission of U does not affect the consistency of this estimate under the conditions of Fig. 5 and if there is no treatment effect heterogeneity then we have a consistent estimate of β D . If there is then, at the least, the estimate is not affected by U.

Regression adjustment and propensity score methods
While OLS and IPTW are different mathematically, they are conceptually similar: both seek to isolate variation in the outcome caused by the treatment by eliminating variation caused by confounding factors. Regression adjustment can be thought of as blocking paths in Fig. 5, which is another way of stating that we are holding X constant in order to isolate the effect of D on Y and obtain the direct causal pathway with the following estimand: On the other hand, IPTW weights outcomes based on the probability of receiving ( p(X) = P(D = 1|X) , creating a pseudo-population that balances confounders across the treatment groups in a similar rationale to randomization. IPTW has the following estimand: For a more concrete example of how a pseudo-population is constructed, suppose an individual in the treatment group had a propensity score of 0.1. In other words, this individual is very likely to receive the control and has many similar subjects in the control group. The weighted outcome of this individual can represent the counterfactual outcomes of the comparable control group subjects and so we would weight that individual's contribution to the treatment effect ten times. As a consequence of such a procedure, in Fig. 5 the pseudo-population will theoretically no longer contain edges for the X's because the treatment assignment cannot be explained by covariate imbalance. A balance of observed confounders, however, does not imply a balance of unobserved confounders: the act of balancing observed confounders may increase the imbalance of these unobserved confounders [41]. Propensity scores are often used to match individuals across treatment groups. Once one or more suitable matches are found for each subject in the treatment groups, we can compute the differences in outcomes, and average them to obtain the average treatment effect on the treated (ATT), E[Y (1) − Y (0)|D = 1] . We focus on IPTW and not propensity score matching because, under unobserved confounding, the ATT will not equal the ATE: While our discussion of methodology in this paper mainly centers around the ignorability assumption, it is important to briefly touch upon the implications of positivity assumption violations. In propensity score methods this means each subject has a positive probability of receiving the treatment given each level of the covariates or 0 < P(D = d|X = x) < 1 for all d ∈ D and x ∈ X . Another way to conceptualize the positivity assumption is the notion of "common support" where there must be full overlap in each group's distribution of propensity scores or, by extension, their observed covariates. Even when there is no unobserved confounding, positivity violations can arise when we fail to observe certain variables that are needed to create overlap. Therefore, for the subpopulations lacking overlap, extrapolating counterfactual claims can lead to erroneous conclusions.
Conceptually, a violation of the positivity assumption means that we are dividing by 0 in Eq. 4. In practice, this results in extreme weights, which both increases variability in the parameter estimates but also impacts  finite sample bias because the estimate is weighted towards the few extreme observations [42]. In addition, near-positivity violations, or individuals who are extremely unlikely to receive the treatment or placebo, pose similar issues for estimation. When using OLS, positivity violations pose a similar bias of the estimand as IPTW. Nevertheless, OLS does not face the same degree of finite sample estimation issues as IPTW when there are positivity or near-positivity violations. While IPTW and OLS target the same estimand, the ATE, and both are vulnerable to inconsistency via unobserved confounding, there are differences to consider in practice. If there are no extreme weights, by encapsulating many covariates in the propensity score, IPTW will generally be more efficient than OLS because of the degrees of freedom saved. If there are extreme weights, however, the instability of the variance of estimators will be larger than that of adjustment-based regression methods.
One common solution to extreme weights in propensity score methods is to trim extreme weights. This procedure, however, risks estimating the treatment effect for a population different than the original target population. In other words, for a decrease in variance, there is a potential increase in bias. Furthermore, the direction of this bias is difficult to determine because one must define a new population resulting from truncation. Though one could argue that the bias due to positivity violations could be advantageously traded-off with the bias due to truncation [43]. Another common solution to extreme weights could be to use stabilized weights as opposed to conventional inverse probability weights [44].
OLS and IPTW also have potential differences in ease of reproducibility and interpretability. Because propensity scores are the result of fitting a model for treatment assignment in order to generate propensity scores, methods ranging from logistic regression to random forests may be used. An issue, however, arises when different models produce different sets of propensity scores resulting in different pseudo-populations. This poses challenges for reproducibility across different studies of the same population. The simplicity of OLS arguably reduces the risk of this since basic adjustment can be easily communicated. On the other hand, when unaccounted for treatment effect heterogeneity exists, OLS will generate a marginal treatment effect estimate that is implicitly weighted by the covariance structure in the observed data sample as opposed to explicit weighting in IPTW [20,24].
One advantage of the propensity score is that datadriven selection of the confounders to model treatment assignment is done separately from the fitting of Eq. 3. In contrast, adding and removing confounders in OLS in a data-driven fashion will also affect the estimand, estimate, and corresponding inference for the treatment effect. Therefore, IPTW is able to control inflation in Type I error from repeated testing of the treatment effect coefficient as a result of fitting several models.
Though it may be tempting to cast estimating propensity scores as a prediction problem, this may lead to unintended consequences. The original philosophy of propensity scores from Rosenbaum and Rubin is not to fit the first-stage as well as possible; rather, it is to find a balancing score sufficient to achieve ignorability [45][46][47][48]. Furthermore, measures of model performance like the C-statistic do not provide useful information to suggest unobserved confounding is mitigated more in one model than another [49]. IVs are predictors of treatment assignment and, yet, they are adverse to causal estimation if included in the model [36,46]. In addition, including variables that are predictive of the outcome but not the treatment can help improve efficiency of treatment effect point estimates [40]. Therefore, we suggest that variable selection for propensity score modeling is not conducted purely by optimizing out-ofsample model prediction error.
In contrast, the simplicity of using one equation in regression methodology assists in avoiding confusion between prediction and inference goals because there is no intermediate step in obtaining an estimate for β D . Indeed, an optimized model MSE may reduce coefficient standard errors and yet could lead to issues in internal validity. For example, if we fit Eq. 3 with LASSO to select confounders that optimized out-of-sample prediction error, the elimination of confounders via shrinkage to zero could potentially induce further omitted variable bias due to no longer conditioning on an observed confounder [50].
A combination of the IPTW and regression methodology is the augmented IPTW (AIPTW) with the socalled "doubly robust" property: a consistent estimate is obtained if either the propensity score (Eq. 2) or the outcome equation (Eq. 3) are correctly specified. Certainly, AIPTW offers more robustness than IPTW but its practical advantage is unclear. Firstly, in the case where one suspects the propensity score equation is misspecified but the outcome equation is not, OLS also will result in a consistent estimate and could theoretically be more efficient. Secondly, an unobserved confounder would cause misspecification in both equations, rendering any estimate inconsistent. As such, it is unclear how the inconsistency due to unobserved confounding in the AIPTW compares to that of IPTW or OLS.
Another name for this scenario is that there is "endogeneity" or correlation of D with the error term. This is because φ i = β U + ǫ i , which is correlated with D and E[φ i |D, X] � = 0 . IV methods will identify a re-characterized treatment, D IV , using exogenous variation (uncorrelated with the error term) such that Cov(D IV , φ i ) = 0.
In the first stage of 2SLS, we regress the X's and Z's (design matrix F ) on D and use the projection matrix P Z = F (F T F ) −1 F to obtain predicted values D IV . In the second stage, we use the P Z to obtain the coefficients β = (S T P Z S) −1 S T P Z Y with design matrix S consisting of the X's and D. By including the X's in both the first and second stages, we can improve the prediction of D and covariates serve as their own IVs. If all assumptions are met, and there is no treatment effect heterogeneity, then β IV D is consistent for β D but not necessarily unbiased. In fact, in the case where we only have one IV for one endogenous variable, the first moment does not exist [51].
There are considerable trade-offs in the IV analysis: consistency comes at the cost of increased standard errors compared to OLS since we only use the exogenous variation in the treatment and, thus, have less "information" to calculate the treatment effect [23]. In the case where the IVs are weak, the finite bias will move towards OLS as weakness increases (i.e. first-stage coefficients go towards 0) and inflate estimator standard errors [22,52]. This is because our treatment effect is determined only by compliers, a subset of the overall population. If the IVs are correlated with the second stage error term (i.e. U) not only will the estimates be inconsistent but the magnitude of inconsistency is greatly affected by the IV strength [23]. This can be observed by deriving the form of the 2SLS estimand under this condition.
Cov(Z, ǫ) captures the degree of the violation in the independence of the IV, and Cov(Z, D) captures the first stage strength. Rewriting covariances as correlations in Eq. 6 and in the OLS estimate α D , we obtain: It is clear that when the IVs are weak, 2SLS inconsistency can be greater than even that of OLS if |Corr(D, φ)||Corr(D, Z)| < |Corr(Z, φ)| . Considering that we can never confirm the IV independence assumption, the use of weak IVs may be perilous.
Resuming our assumption of valid IVs, another perhaps helpful perspective is that the first-stage is chiefly a prediction task. In this interpretation, weak IVs lead to inaccurate first-stage predictions, which leads to finite sample bias because we are unable to adequately capture the treatment assignment of the original population of interest. Simply adding more weak IVs to the first-stage rarely improves the issue; indeed, packing the first-stage with too many instruments will lead to overfitting and, hence, finite sample bias [53]. These issues are partially mitigated by large sample sizes.
Unlike many of the assumptions discussed, the degree of instrument relevance is somewhat determinable using the data. Because 2SLS utilizes OLS for the first stage, the F-statistic is commonly used to measure the joint strength of the IVs with 10 being a "rule of thumb" for sufficient strength. In the case of heteroskedasticity, a robust F-statistic can be used but variance estimates may be noisy [54]. Nevertheless, using a sample statistic to infer upon assumptions could be problematic. For instance, Young 2017 points out that there is a relatively high chance of spuriously obtaining a high F-statistic and unreliability of "guaranteed bounds" of size and bias of weak IVs tests. In addition, weak IV tests assume that the IVs are valid in the first place or else coverage will be incorrect [55].
Data-driven fitting of the first-stage may prompt concerns about external validity. In theory, we could select variables in the data that optimized a cross-validated F-statistic. In this process, however, we would fail to address the impacts on the interpretability of the LATE. This scenario introduces difficult situations where one set of IVs could have a worse F-statistic but is more interpretable for the desired use case. This is why we advocate for first thinking through the conceptual soundness of the selection of IVs given a set of theoretically strong predictors of the treatment.
Although the independence of the IVs is untestable, there are a few interesting falsification tests that one could employ. One straightforward test is to compare the values of potential confounders across values of the IVs similar to how one examines values of potential confounders across levels of the treatment [9]. Imbalance of IVs across an observed confounder is problematic because the observed confounders could be related to an unobserved confounder. Another test that assumes observed confounders may be related to unobserved confounders involves negative control outcomes, or populations constructed to falsify IV independence [56]. Of course, these tests cannot ensure IV assumptions are met and are subject to data availability.
In the multiple IV case, given there is one valid IV, one may test if additional IVs influence the outcome through Sargan's J-statistic [57]. By running OLS on the residuals of 2SLS on the IVs, if one or more coefficients are not 0, we have some sort of violation. This test may have deceivingly high p-values and poor power when one IV is weak but valid and the others are strong but invalid [58]. The scenario of a mix of strength and validity of IVs poses an interesting question about whether one would choose a strong but slightly invalid IV over a weak but valid IV.

Complexities that arise with non-linearity
So far, our discussion has been focused on the case where the outcome is continuous and, thus, we can reasonably assume linear structural equations. When the outcome is not continuous, some of the statements we have previously made must be modified. In particular, we will revisit the consequences of adjusting for confounders, IVs, and precision variables, marginal versus conditional estimands, and the use of 2SLS for binary treatments and non-continuous outcomes.
First, in the non-linear model setting the estimand corresponding to the treatment effect will change by including not only confounders but also precision variables because of non-collapsibility [18]. Mathematically, because the covariates are encapsulated in a non-linear function (e.g. a link function), after adjusting for a precision variable, we cannot simply distribute the expected value such that we recover the "before adjustment" treatment effect [59]. A further consequence is that adjusting for any variable will increase coefficient standard errors of the variables already present. For instance, in logistic regression, the unexplained variance must stay fixed so the explained variance will increase upon adjustment for new variables, leading to coefficient values increasing in magnitude [60,61]. Adjusting for precision variables will increase standard errors of the treatment effect but slightly increase the power to reject the null of no effect; this is due to the magnitude of the adjusted point estimate increasing relative to slight increases in the standard error of the estimate [62].
The presence of non-collapsibility means that the interpretation of coefficients before and after adjusting for variables differs even in cases where there is independence between the adjustment variable and the treatment variable. Marginal estimates, without confounders, will therefore be different from conditional (on the confounders and precision variables) estimates. Comparing methodologies, IPTW will give a marginal estimate whereas regression will give a conditional estimate [61]. For example, after adjusting for the confounders, IPTW produces the odds ratio for the population characterized by the sample while a logistic regression model would compute the odds ratio for someone with the average value of the confounders [63]. In the linear case, the difference between conditional and marginal effects was only present under treatment effect heterogeneity.
Whether the marginal or conditional estimand is preferable depends on the scenario. One could argue that the conditional estimand is more applicable to settings where a physician is already conditioning upon knowledge of various patient characteristics like sex, age, and comorbidities. In addition, conditional estimands may be better transported to other populations such as future populations. On the other hand, marginal estimands can be more interpretable and comparable across studies [64]. They may also be preferred when the covariates potentially included in the model are not easily observed or measured in practice.
The issue with using IVs as confounders persists in the non-linear setting: adjusting for IVs as if they were confounders amplifies existing bias none and could even introduce bias where none previously existed [38,39]. Amplification is for the same reasons as the linear setting while introducing bias occurs when the IVs are dependent on the outcome given the treatment [65]. Though Ding et al. do find specific situations where this bias was not present under their proposed monotonicity conditions for the treatment selection and outcome model.
A primary challenge for standard IVs methods is that 2SLS misspecifies a non-linear functional form in the case where we have a binary treatment. Nevertheless, analysts still may utilize 2SLS in non-linear settings such as through LPMs. Simulations have shown that LPM can produce low inconsistency in the estimates of the LATE [66]. A counterpart of 2SLS, two-stage residual inclusion (2SRI), where one takes the residuals from the first stage as a covariate in the second stage, did not perform nearly as well. Furthermore, claims that non-linear 2SRI, (using a probit model for example) are able to recover the ATE (as opposed to the LATE) are questionable [67]. Quantifying the effect of IV method misspeficiation relative to the confounder methods is an avenue of research.
Intuitively, using LPMs appears inappropriate as there is nothing preventing prediction of values that are outside of the interval [0, 1], which leads to bias and inconsistency of LPM estimates. However, if the population of interest only has probabilities between a certain range not close to 0 or 1, then this will seldom be an issue because we will not have errant predicted values. Thus, LPMs will be consistent and unbiased [68]. One could also argue if there are true probabilities that are 0 or 1, then we have a positivity violation and propensity score methods, whether they use a logit, probit, or LPM to calculate propensity scores, will also run into issues. One solution for any method is to truncate probabilities but this will be at the cost of bias for the original causal effect. We can rarely mimic the 2SLS substitution procedure with non-linear models, such as two-stage probit, and maintain consistency for the treatment effect (such an action is called "forbidden regression") [23]. Instead, one can consider a three-step procedure: fit the first-stage with a non-linear model, regress the predicted values on the treatment in OLS excluding the IVs, and, lastly, fit the second stage with a linear or non-linear model.

Flexible confounder and IV methodology
Recent developments in causal inference methodology have been mostly focused on relaxing assumptions on common approaches like OLS, IPTW, and 2SLS. These developments are predominately motivated by the wish to reduce the impacts of model misspecification and the view that many subtasks of estimation are primarily prediction-based, allowing for more flexible modeling using ML. Another motivation that we will not discuss is sparsity (e.g. incorporating more confounders or IVs than observations). We will cover three approaches: targeted minimum loss-based estimation (TMLE), Post-LASSO (PL), and double machine learning (DML) [10, 12,69]. TMLE improves upon AIPTW by modeling nuisance quantities flexibly in addition to achieving double robustness. Nuisance quantities are needed to calculate the treatment effect but are not of direct interest to the research question. For example, the propensity score can be considered a nuisance quantity. By using ML and a "targeting step, " we can avoid nuisance quantity misspecification and optimize the bias-variance tradeoff for the ATE [70, 71]. To estimate nuisance quantities, TMLE includes cross-validation-optimized weighting ensemble of ML algorithms (e.g. penalized regression, random forest, gradient boosted trees) and "cross-fitting. " Crossfitting bears some similarities to cross-validation in that "out-of-sample" data, or partitions of the data not used to fit models, is utilized to reduce overfitting. In addition, cross-fitting allows one to avoid proving complicated Donsker conditions for asymptotic normality [72]. In addition, TMLE offers some robustness in near-violations of the positivity assumption.
Briefly, the estimation procedure of TMLE is as follows. Let D represent the assignment of a binary treatment and X a set of confounders. First, perhaps using ML, we use the confounders to fit initial models for treatment assignment and outcome: g(D, X) and Q 0 (D, X) , respectively. Following this, we combine predictions from these equations along with a fluctuation parameter ǫ to give us double-robustness and local efficiency. Now, we will update the outcome equation to Q 1 (D, X) and then estimate the ATE using a "plug-in" estimator of the form While TMLE takes a plug-in estimator approach, DML, and its precursor PL, utilizes estimating equations allowing them to compute both confounder and IV estimates. These methods utilize orthogonalization and the Frisch-Waugh-Lovell theorem to first perform regressions on the confounders and IVs and then use their residuals in subsequent estimating equations for the treatment effect that are "locally insensitive" to misspecification. Similar to TMLE, we utilize ML algorithms to fit the residualcreating regressions. The post-LASSO name comes from the fact that we first run LASSO regression to select variables and then refit OLS on the selected variables, which mitigates issues in LASSO variable selection. DML extends this notion of estimating nuisance quantities to more ML algorithms and includes double robustness. Asymptotic considerations, namely Donsker conditions for normality, are handled through cross-fitting.
For the IV model, we use the moment condition E[� i (β 0 , η 0 )] = 0 where β 0 is the coefficient for the treatment effect and η 0 are the coefficients for the nuisance quantities based on the confounders and IVs. The score function takes the form � i (β, η) = (ρ y i −ρ y i β)ṽ i , which is simply the canonical form of a generalized method of moments using ṽ i as an IV. The ρ y i , ρ y i β , and ṽ i terms are are all results of orthogonalization. For example, ρ y i = y i − x T i θ where θ was estimated via PL or LASSO. DML extends the orthogonalization procedure to other ML algorithms by assuming partially linear models.
The "robustness" to misspecification, whether via unobserved confounders or regularization of relevant confounders, is through the orthogonality condition: Equation 7 essentially states that if there is some error in estimating the nuisance quantities, the resulting estimate will still be consistent for β 0 . The authors call this "Neyman orthogonality" as the concept of "locally" robust estimates dates back to work done by Jerzy Neyman in 1959 [73]. The robustness of local insensitivity remains to be rigorously investigated in real world scenarios. The fitting of the first stage of an IV analysis using ML and cross-fitting could result in inference more robust to weak IVs. In 2SLS, the estimated coefficients and, by extension, error term in the second stage is correlated with that of the first, resulting in "finite sample" bias. Thus, if ML can generate better predictions than OLS, we can decrease the first-stage error. In addition, we can utilize sample-splitting to break the association between the two error terms, which reduces the impact of weak IVs [52]. We first partition the sample, for example in half, and use the first half to fit the first-stage and compute the predicted values for the treatment with the second half. The predictions based on second half are then used in the second stage. Cross-fitting expands on the idea of sample-splitting by repeating the process on the "unused" partition (i.e. partition used in the second stage) and then pooling the results across each fold. Both DML and TMLE use the cross-fitting procedure while PL only uses sample-splitting.
Theoretically, the benefits of ML and causal inference methods are clear but the extra complexity carries some caveats. There is a great danger of overfitting and spurious predictions, especially when the sample size is too low. For example, in predicting binary outcomes, one simulation study found that 20 to 50 "events" per variable were necessary for logistic regression to achieve stable prediction performance [74]. For more sophisticated methods like neural networks and random forests, instability persisted above 200 events per variable. Translating this to causal inference, because of overfitting, the external validity of ML prediction is questionable, especially when the proportion of treatment assignment is near 0 or 1 as even larger samples will be needed. Furthermore, ML methods could diminish the reproducibility of causal effects across studies and can be computationally expensive without much gain over "traditional" methods.
The in-practice performance of ML and causal inference of these algorithms is still under question. In one simulation study, Angrist and Frandsen were able to find meaningful gains in selecting confounders but not IVs using ML [13]. One interesting phenomenon they highlighted was "pretest bias:" when LASSO decides whether to retain a coefficient or not, there is an implicit test being performed against some threshold contingent on the lasso regularization parameter and the sample size. Consequentially, when the true first-stage coefficients are zero or near-zero, bias in the treatment effect is introduced [75]. In the paper, sample-splitting or crossfitting mitigated pretest bias with ML not offering notable improvement in parameter estimation. The apparent importance of out-of-sample procedures highlights further needs for large sample sizes in sophisticated techniques.
Perhaps the work on ML and causal inference, though innovative, "buries the lede, " so to speak. Far and away, unobserved confounding is the greatest threat to the validity of observational studies. Though nuisance quantities can be more flexibly modeled, the bias due to omitted variables likely dwarfs potential model-fitting issues that arise when using traditional models such as OLS. Sophisticated methods are more robust to violations in the assumptions than their traditional counterparts but this does not eschew properly selecting confounders and IVs -evaluating the paradigm is paramount. For example, there has been some literature on estimation with some "imperfect" IVs that relax some variable selection techniques but these methods presuppose that one has at least some valid IVs [76].
In terms of external validity, in non-linear settings, it is also unclear whether these methods estimate a marginal or conditional effect. For the IV setting, if the firststage is characterized by ML, where there is a significant amount of regularization, then the definition of the LATE can be vague. More perilous is the possibility for confounders that ensure conditional independence of the IVs to be omitted from the first-stage due to regularization. In PL, one could simply choose to not shrink to zero certain variables but for DML, it is unclear how this can be prevented in more sophisticated algorithms like random forests.

Comparing approaches for applied data analysis
In this section, we offer general principles and guidance in a series of steps to help an analyst weigh the validity of the confounder approach to the IV approach. Why not utilize both approaches in the same analysis? Certainly, if the appropriate variables are available then one can use IV estimates as a sensitivity analysis of confounder adjustment estimates and vice versa. An issue, however, arises when each approach offers conflicting results such as oppositely signed effect estimates or when one is statistically significant and not the other. In these cases, one will have to judge which estimate is more desirable in terms of robustness regarding consistency, utility, and generalizability. A visualization of the heuristic in this section is presented in Fig. 6.
The first step is to assess the severity of unobserved confounding. In our view, compared to a confounder adjustment approach, the IV approach trades increased complexity and decreased interpretability for potential consistency in the setting of unmeasured confounding. Therefore, unless there is evidence to suggest that the confounder approach will suffer from notable unobserved confounding, the more straightforward approach should be pursued. In order to evaluate unobserved confounding, one may identify confounders separate from Furthermore, one should postulate the magnitude of unobserved confounding as there could be scenarios where the lion's share of variation due to confounding can be captured by a few observed confounders. For example, a rich electronic health record dataset may not be missing important confounders of interest whereas an insurance claims dataset is frequently missing valuable medical information. In addition, because the IV approach largely builds upon the confounder approach by incorporating confounders in the first and second stages, identifying confounders remains important to help gauge the validity of potential IVs. When performing this step, one should keep in mind how conditioning on certain confounders may affect interpretability and reproducibility. Further, one may consider the potential for treatment effect heterogeneity as this also affects the generalizability of confounder methods via effect modification, keeping in mind that omitted confounders could also be effect modifiers [77]. In reviewing treatment effect heterogeneity, it also may be a good time to weigh the usefulness of marginal versus conditional treatment effects for the purposes of the study at hand. This will assist in interpreting regression adjustment and IPTW output.
In the second step, one should identify whether there are any potential IVs in the paradigm and then in the dataset. Even if a confounder approach will be used, adjusting for IVs can amplify inconsistency. Theoretically and empirically, one should explore the strength of the IVs. From this, we can hypothesize the degree of invalidity of the selected IVs in terms of the independence assumption. The order is important: if the IVs are strong but slightly invalid then the degree of inconsistency can still be better than a confounder approach; if the IVs are seemingly weak but valid then they can potentially be salvaged by more sophisticated algorithms. Just how invalid an IV can be to still be better than confounder methods is a developing topic of research. Of note, Pearl (2010) investigated this question with one confounder and one imperfect IV using a threshold of when to choose the candidate variable as a confounder [39]. Nevertheless, the interpretability of this threshold could be improved to better accommodate applied analyses and does not consider the potential results from an IV estimation procedure such as 2SLS.
If there is notable treatment effect heterogeneity per the previous step, one should describe the subpopulation characterized by the LATE and any potential differences between the LATE and ATE. One should also justify whether this subpopulation is worth studying in the first place. If the subpopulation is too obscure, a biased confounder approach is perhaps of more scientific value.
The third and final step is to examine other relevant characteristics of the dataset: generally, the sample size and, if the treatment is binary, the distribution of the proportion of individuals that received the treatment. Large sample sizes cannot salvage confounderbased estimates in the presence of a large amount of unobserved confounding but can greatly help the IV approach. Notably, IV methods have larger standard errors for the treatment effect compared to a confounder method like OLS, which is only exacerbated by weak IVs. In some cases, one could argue that IV methods could produce Type II error rates so large that the analysis is not worth pursuing. Therefore, not only will a larger sample size create a more precise estimate and lessen finite estimation bias but, in addition, the sample size will open the door to more sophisticated methods to be used reliably like ML and cross-fitting. Accordingly, we may strengthen the first stage and decrease sensitivity to exclusion restriction violations.
For a binary treatment, the distribution of the proportion of the population that receives the treatment is important both for the validity of the positivity assumption for propensity score-based methods and for the validity of using an LPM or 2SRI for the IV first stage. Probabilities near 0 or 1 are an issue for both approaches but it remains to be investigated whether positivity violations cause more inconsistency in confounder methods when compared to 2SLS. Under near-positivity violations, we may also run into concerns surrounding Type II error as we would for IV methods. We note these notions are particular to the scenario but sample size does not necessarily mitigate the effects of near-positivity violations under extreme imbalance of treatment assignment.
A notable data-driven diagnostic of the relative performance of confounder adjustment methods versus IV methods under observed confounding (recall that unobserved confounding can also violate the independence of the assumption of an IV) is the bias ratio [78]. Suppose we have the OLS in Eq. 8: If we have a binary IV, we can derive the bias from omitting the confounder X 1 for OLS and 2SLS as We can take the ratio of Eqs. 9 and 10 to obtain a bias ratio where a value greater than 1 implies that the 2SLS estimate is more sensitive to an omitted confounder than the OLS. If we assume that unobserved confounding is correlated with the observed confounder that we intentionally omit then we gain information about each method's sensitivity to unobserved confounding. Improvements include adding variance to the bias ratio to represent uncertainty in estimates and considering the magnitude of β 2 in our ratio [56,79].

Illustrative example: off-label donepezil for MCI patients
We now turn to our applied data analysis surrounding donepezil, a cholinesterase inhibitor, off-label use in MCI patients. We will pursue the causal inference analysis steps outlined throughout the paper: first, we will outline the scientific paradigm of the analysis and define confounders and instruments; next, we will examine our available dataset and its ability to answer the causal question of interest and limitations on this goal; lastly, we will compare and contrast confounder and IV methodologies with respect to causal effect estimates.
We analyze data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), an observational dataset that longitudinally tracks cognitively normal, MCI, and AD volunteers aged 55-90 with good general health. Among other information, the data includes both general health and specific neurocognitive data collected roughly every six months until the patient leaves the study [80]. Our dataset contains patients from several ADNI recruitment waves beginning in October 2004 until May 2021 when the data was extracted for analysis. Of note, those already on cholinesterase inhibitors such as donepezil are allowed into ADNI if they have been stable on the medication for at least 12 weeks prior to entry.

Study paradigm
Our goal is to estimate the difference between donepezil users' and non-users' two-year change from baseline in ADAS-cog. Furthermore, we are potentially susceptible to confounding by indication, which warrants using the confounder and IV approaches to isolate the causal estimand of interest.
To identify confounders, we begin by listing potential general predictors of ADAS-cog scores such as age, sex, family history of neurodegenerative disease, socioeconomic status, physical disabilities, race, and comorbidities. In addition, we can list particular biological attributes: the number of APOE4 alleles and cerebral spinal fluid (CSF) measurements of phosphorylated tau (P-tau) and amyloid-β (Aβ ). Except for physical disabilities, all of the aforementioned variables could be potential influences on the assignment of donepezil and, hence, we consider them confounders. A hypothesized DAG of causal relationships with a few of the potential confounders is presented (Fig. 7).
If there are differential prescriber attitudes to donepezil then physician and facility prescribing patterns could be used as IVs [33]. A similar possible IV is the time since Food and Drug Administration (FDA) approval of donepezil for AD patients in 2004. In the roughly two decades since approval, a large volume of research has been conducted on the MCI construct and donepezil in MCI patients. As a result, the general perspectives of practitioners may have changed over time. Of course, we must assume the IV is monotonic; that is, attitudes towards donepezil use are monotonically more favorable or less favorable over time.
Another potential category of IVs is related to contraindications such as bradycardia or gastrointestinal disorder, which may make off-label prescription unfavorable relative to benefits. Importantly, we should note that contraindication IVs are sometimes related to the general health of a patient, which is, in turn, related to cognitive decline. In this case, we should be particularly cautious in utilizing certain contraindications as IVs and should only do so if we can adjust for most of the related confounders.
Identifying possible IVs allows us to examine whether the LATE is of scientific interest. All of the IVs previously mentioned essentially capture variation caused by differential physician prescribing attitudes. Therefore, under these IVs, a complier is a patient that follows the judgments of the physician and the LATE would be this population's treatment effect. The resulting estimate for the complier treatment effect could be useful because a formal analysis using this LATE to evaluate donepezil off-label practice intends to influence guidelines for physicians, which are constructed with the assumption of patient compliance to physician advice. Defiers for these IVs are unlikely as individuals usually follow physicians' judgments regarding risks and benefits.

Cohort selection
A flowchart of the selection criteria is depicted visually in Fig. 8. We ascertained off-label prescription donepezil from self-reported pharmaceutical use and commencement dates collected at every visit in ADNI. Using keyword search, we first found patients that were prescribed donepezil and then used diagnosis data to identify prescriptions during MCI. Patients who had never reported use of donepezil were considered to be a member of the control group. Furthermore, individuals who received donepezil while having AD were also used as controls but their data was artificially censored at the visit before the first donepezil prescription.
For the treatment group, the start date was the selfreported commencement of donepezil and we required that the baseline measurement of ADAS-cog must be from the closest visit within six months of this start date. If no such visit existed, these subjects were dropped from the study. Under an intent-to-treat principle, [81] if we were able to observe a patient beginning donepezil, then they remained in the treatment group. The end date was computed by adding two years to the start date and taking the closest visit between 21 months to 30 months after the start date. Once again, if no such visit existed then the subjects were dropped. The control group start date was the first visit after MCI diagnosis. For many, this was their ADNI entry visit. Like the treatment group, the end of the date was computed by adding two years to the start date and applying the same 21 to 30 month requirement.
Mimicking the Peterson et al 2005 trial, on top of the ADNI inclusion criteria, we applied further inclusionexclusion criteria from this trial depending on the availability in the data. Specifically, we only included those with an MMSE of 24-30 at baseline and excluded anyone with self-reported bipolarism, schizophrenia, suicide ideation, or psychosis. Applying these criteria resulted in 134 donepezil subjects and 483 control subjects.

Examining the data: confounders and IVs
Considering Fig. 7, which depicts major confounders, we hypothesize that there are notable unobserved confounders. For example, we cannot adjust for socioeconomic status (SES) nor family history. SES is not in the dataset while family history contains too many missing values to impute (approximately 93% before inclusion criteria). We have information on comorbidities via three indicator variables: the presence of cardiovascular disease, other neurological disease, and renal disease. These indicators are limited as they fail to capture severity and not all the conditions these categories include are relevant. Overall, we judge that the impact of these missing or low-quality confounders is not too large with the exception of family history, which is likely asked during a doctor's visit.
Another potential area of unobserved confounding could come from the lack of historical information on a patient: we do not have neurocognitive test scores on these patients before entry into ADNI but a prescriber may have this information and consider it before making a decision. Most subjects' study baseline visits are either close or at ADNI entry, which precludes the calculation of any sort of useful measure of history from the data such as a first-order trend. In addition, there are many difficult to measure evaluations of patient well-being that stem from the intuition of the attending clinician. For observed variables that may serve as confounders, we have demographic information on sex, age, years of education, and APOE4 status. One necessary confounder induced by our sampling scheme is time length, in years, the patient was in ADNI before the start of the analysis. This is associated with the outcome because the longer someone has MCI, the more opportunity there is for cognitive decline. Moreover, time in ADNI is associated with treatment assignment as most of the start dates for the control group were entry into ADNI while the start dates of the donepezil group tended to be several months into ADNI (Table 3).
We utilize CSF measurements for P-Tau and A β but only at ADNI entry because missingness was too high otherwise (36.3% for both variables' entry measurements and 84.3% for all recorded measurements). We believe that for individuals with start dates further from baseline, combining entry P-Tau and A β measurements with time in ADNI is sufficient to extrapolate to the study baseline.
Overall, while the observed confounders serve as useful proxies of the unmeasured components, we judge that there still remains unobserved confounding by indication. Therefore, it is warranted to investigate the IV approach. Unfortunately, ADNI did not lend itself to intuitive and high-quality IVs. Nevertheless, we utilized time, in years, since FDA approval of donepezil and contraindications in the form of indicator variables of encompassing cardiovascular issues such as bradycardia and sick sinus syndrome, gastrointestinal disorder, and asthma. More details about these IVs including investigations of strength and validity can be found in the supplementary appendix. Notably, these IVs were weak but the LATE is scientifically meaningful for reasons discussed earlier.
We note that the weakness of the IVs combined with a lower sample size cautions us against using IVs methods. However, for the purposes of comparison to confounder methods, we will still fit models with these IVs. Lastly, the proportion of individuals in the dataset who took the treatment is not extreme so the use of the LPM in the IV methods is valid.

Statistical methodology
Before applying selection criteria, ADAS-cog, MMSE, A β at ADNI baseline, and P-tau at ADNI baseline were missing for roughly 30% of the observations. The comorbidity and contraindication (IV) indicators were also missing but only for 0.4% of the observations. These values were imputed using multiple imputation via chained equations (MICE) resulting in five different datasets. Specifically, to account for within-subject correlation, we performed predictive mean matching using a linear mixed model with the fixed effects of time in ADNI, clinical dementia rating scale sum of boxes, MMSE, ADAS-cog, age at ADNI entry, sex, APOE4 count, whether they had taken donepezil, the comorbidity indicators, β at ADNI baseline, and P-tau at ADNI baseline. We included a random intercept and a random slope on time in ADNI. To obtain a "baseline" estimate of the treatment effect without any adjustment of confounders, we regressed the change in ADAS-cog on treatment assignment. Then, we utilized the variables in Table 3 with the exception of starting ADAS-cog as confounders for OLS, IPTW, TMLE, and DML. For IPTW, the propensity score was computed using logistic regression. Diagnostics are reported in the appendix. For TMLE, we utilized five-fold cross-fitting (5) and random forests to fit nuisance quantities. For DML, random forests were also used for fitting models involving nuisance quantities. The IV analyses included the same confounders and the aforementioned IVs. For models, we fit 2SLS, PL regularizing the firststage, and DML that used random forests for nuisance quantities. To compare the strength of the first-stage, pooled F-statistics were computed on the first-stage OLS involving all the IVs and confounders and the IVs and confounders selected by PL. Estimates across imputed datasets were combined using Rubin's rules. All analysis was conducted using R Version 4.0.3.

Results
The results, presented in Table 4, demonstrate a clear divide between the confounder and IV approaches. While the confounder methods found a point estimate that suggests donepezil has a deleterious effect on MCI patients, the IV methods produce an attenuated but beneficial effect estimate. Taking into account the standard errors, however, the IV methods cannot rule out a null effect of donepezil.
Among the confounder methods, differing levels of efficiency led to some methods attaining statistical significance using a level .05 test, while others did not (with the exception of the unadjusted OLS). TMLE yielded the lowest estimated standard error. In contrast, DML 1 with random forests had a large standard error possibly due to the lower sample size and less efficient use of the data in cross-fitting. When we used DML 1 with logistic regression and OLS to estimate nuisance quantities, the point estimate and standard error of the estimate decreased to be closer with the other confounder methods (Estimate: 1.48, SE: 1.16). Similarly, substituting linear models for random forests in TMLE increased the standard errors but did not modify the point estimate though it was not significant (Estimate: 1.33, SE: 0.99). Subsequently, the estimate became statistically insignificant ( p = 0.178 ). OLS and IPTW had slightly different results with IPTW having a larger point estimate (i.e. a more harmful effect) and a larger standard error. The different point estimates could be due to slight treatment effect heterogeneity while the lower precision of IPTW estimate could be due to near positivity violations (see Appendix).
Regarding PL, we originally intended to utilize the authors' R package (named hdm) default "data-driven" selection for , the regularization parameter in the first stage LASSO [82]. However, the package-selected value of led to all IVs being eliminated from the first stage and subsequently unreliable results. Figure 9 shows that at = 0.075 , when the instruments were all excluded, the point estimate and standard errors change notably. Noting that when there was at least one IV the results largely remained the same, we reported the results from = 0.025. Across all imputed datasets, the first-stage LASSO with = 0.025 selected time in ADNI, baseline A β , and time since FDA approval. For two imputed datasets, LASSO retained baseline P-tau and, in one of these two datasets, asthma was additionally retained. Selecting variables strengthened the first-stage with the PL first-stage fit being much higher than 2SLS (pooled F = 33.3 and pooled F = 8.5 , respectively).
Ostensibly because DML 2 builds upon PL with better practices and more sophisticated ML techniques, we would expect similar results but this was not the case. The point estimate was the most negative and most extreme besides the unadjusted OLS. One explanation could be related to Fig. 9 where "over-regularization" of the first-stage produces extreme results. When we replaced random forests for OLS in DML 2 we obtained results more similar to the other IV methods (Estimate: 0.154, SE: 2.712) suggesting a sensitivity of DML point estimates to the choice of the prediction model in both approaches. We were not able to fit DML with LASSO as the authors' official package (dml) did not support nonnumeric covariates specifically for the LASSO algorithm.

Takeaways from applied example
In this analysis, we found that the two different approaches presented two different conclusions. Contextualizing these results, the Peterson et al. clinical trial found a -0.27 point change in ADAS-cog after 2 years for the donepezil group compared to placebo that failed to attain statistical significance (95% CI: -1.13 to 0.59 points), [1] which was similar to the IV estimates. Meanwhile, our confounder-based estimates mirror a similar analysis using ADNI by Schneider, Insel, and Weiner conducted in 2011 [5]. Using a linear mixed model, the authors found a statistically significant increase in the rate of decline in donepezil users compared to control subjects. Using this slope to calculate the difference from baseline after two years, we arrive at a comparable estimate of 1.56.
Adding the more sophisticated ML-based algorithms did not show notable changes over traditional methods and even created a few dilemmas. For PL, the default algorithm only included baseline A β , selecting out all IVs likely due to their weakness. This was "a feature, not a bug": the first-stage F-statistic was indeed the highest at 60.94, indicating a "stronger" first-stage. But without an IV in the first-stage, the results are theoretically meaningless, which cautions against a purely "greedy" approach to fitting the first stage.
Another issue with PL was that across imputed datasets, the variables selected for the first stages were different. Thus, both the confounder and IV estimands across imputed datasets are not exactly the same. For example, the estimated LATE from selecting only time since FDA approval as IV was combined with the estimated LATE using both time since FDA approval and asthma. If this is the case, Rubin's rules pooled estimates are actually an average of different LATE, which is ambiguous. This is yet another problem with a "greedy" approach to the first-stage as it is likely that a similar process occurred for TMLE and DML random forests.
One interesting discussion surrounds the choice of hyperparameters for TMLE and DML, particularly which ML algorithms should be selected given that the results changed depending on the algorithm chosen: TMLEderived results became insignificant while the DML 2 point estimate flipped signs. This raises concerns about replicability issues that may arise from an arbitrary choice of ML algorithm. The criteria for selecting ML algorithms is presently unclear from both the authors of TMLE and DML. In our case, the choice of random forest was made before any results were computed. This analysis contains several limitations. Firstly, the selection criteria were not exactly the same as the Peterson et al and Schneider et al, leading to the populations we were estimating the effect of donepezil to be slightly different. The exclusion of MCI individuals from ADNI who have not been stable on donepezil skews the results towards the null as these individuals likely experienced harm from donepezil but were missing from the data. In addition, the control group in our analysis is aware they are not getting donepezil whereas in a trial, the control group received a placebo in a blinded fashion. Selecting the baseline and after 2-year measurement dates could use improvement where, instead, we could have used longitudinal modeling and modeled the rate of decline in ADAS-cog. Unfortunately, the more sophisticated ML methods are not yet equipped to handle clustered data. Even so, the main purpose was to compare our results to Peterson et al, which utilized 24-month change. Starting and ending dates often were missing the day of the month leading to inaccuracy in measuring exact 24 month changes. For the instrumental variable strategy, we were relying on self-reported health issues, a reporting process that suffers from recall bias. Most likely there were omitted or misdiagnosed conditions, which meant that our IVs were much weaker than they could have been given accurate medical history. Lastly, dosage of donepezil and compliance could be two critical nuances in evaluating treatment effects that we were not able to measure here.
As an example of compliance, it could be that the already more severe MCI subjects would also have issues remembering to take their medication. Despite these limitations, the variation in estimands and resulting estimates between confounder adjustment and IV approaches remains of interest.

Conclusions
In this paper, we have outlined a paradigm for isolating causal effects in observational studies that centers around two approaches: confounders and IVs. We began by defining each approach and its often untestable assumptions. We then discussed a set of heuristics for executing the approaches from identifying relevant variables to examining external validity. In addition, we highlighted that the two approaches often overlap (e.g. potential bias amplification by IVs used as confounders) and, thus, that one approach cannot be undertaken without some consideration of the other. Following this, we discussed three popular methodologies, OLS, IPTW, and 2SLS, for conducting statistical modeling along the confounder and IV approach including comparing OLS to IPTW and why causal inference modeling should not simply be viewed as a prediction task. Further, we explored the implications of violations in the assumptions of each of these methodologies and how to operate when these assumptions could be violated. After touching upon causal inference in non-linear scenarios and more flexible models, we outlined a set of steps to consider when weighing each approach. The principles and processes laid out in the above portions culminated in the applied example surrounding off-label donepezil where each approach provided different challenges and, ultimately, conflicting estimates. A natural question to ask is: which approach was better at isolating the causal effect of interest? Of course, we do not have knowledge of the true causal effect in the general population, so it is impossible to have a definitive answer to this question. This mostly owes to the untestable assumptions discussed throughout the paper such as proving we have accounted for virtually all confounding and that we have valid IVs. Even then, the confounder and IV estimates may be targeting different causal estimands and, furthermore, the process of outlining the scientific paradigm and causal mechanisms (e.g. identifying IVs) included many conjectures. Nevertheless, for sake of argument, we can speculate on the above question by judging potential scenarios in our analysis and how they may impact each approach.
In surmising what the true ADNI causal effect could be, we can first consider the Peterson et al trial as we know this estimate is, on average, unbiased for the treatment effect in the trial population. But are the populations and usage of donepezil in ADNI at the time we obtained the data similar to that of the RCT? On one hand, recruitment criteria for ADNI are akin to MCI clinical trials and we imposed extra inclusion-exclusion criteria similar to Peterson et al. On the other hand, there may be cohort effects from 2005 to the present as the understanding of MCI has developed since then and, further, we cannot fully account for compliance and concomitant medicines.
Certainly, if we assumed the clinical trial was a good benchmark for our study, the IV methodology would be the better performer. However, we must weigh the fact that the lack of high-quality IVs and low sample size likely led to large finite sample bias and standard errors. The difficulty in finding IVs underscores the fact that confounders are much more intuitive to source than IVs. Yet, both are equally important as paths to consistent treatment effects, and, as such, one aim of this paper is to increase awareness and education of IVs such that they can be identified and investigated as easily as confounders.
Considering the available observational study evidence, particularly Schneider, Insel, and Weiner's ADNI study, we can pose two questions. Firstly, like in the previous paragraph, how different is the population of our study? Secondly, seeing that our confounder results are similar, can we postulate sources of unobserved confounding and hypothesize how these sources have skewed our results? To answer the first question, because we have more present data, we can certainly re-perform our analysis on the subset of ADNI that Schneider had (they pulled their data on May 2009) to observe how the results changed. Doing this, we observed similar observational study estimates. But this far from disproves that there is no cohort effect that makes our populations irreconcilable.
For the second question, one possible explanation for confounding by indication is that more severe patients receive donepezil, which can skew a confounder estimate that fails to fully account for this towards one that suggests harm. Although, we are limited in this line of thinking because of the difficulty of making accurate statements regarding how a physician decides if a patient is "severe enough" to be prescribed donepezil. If we were to suppose the doctor looks at the steepness of the rate of decline then a historical measure such as a first-order trend of their cognitive tests scores could mitigate some confounding. This highlights the need for more research to specifically quantify factors contributing to physician decision making in specific disease areas.
To conclude, our goals with this paper were, firstly, to highlight the key differences between the confounder and IV approaches and estimates, and, secondly, to lay out a set of principles and heuristics that can help an analyst choose the best tool under the potential violation of untestable causal assumptions. The key is to favor the approach that one has judged to violate the assumptions to a lesser degree than the other. Of course, if there are assumptions we can do nothing about, one could decide to simply not pursue causal estimates from observational studies. Yet, this would discount the compelling scientific reasons to conduct observational studies in health research such as for hypothesis generation. Therefore, skeptical yet practical discussions of how to best move forward and avenues of further development, such as the ones in the paper, are of utmost necessity.